Larry Peng

Income Inequality

Using World Bank Data, I see if there are indicators that can be used to predict a nation's level of Income Inequality.

All .csv data used in this project is from https://data.worldbank.org/

Initial Imports

Starting Approach

To begin, I am curious if a nations level of wealth can be a good indicator of its income distribution.

Are wealthier nations more likely to have greater economic disparity? This might make sense as wealthier nations may be home to larger coorporations that concentrate wealth and create billionaries.

Or are less wealthy nations where the greatest income inequalities are seen? This might also make sense as developing economies may have greater investment opportunities for those who already hold wealth.

To approach this question, I will look at GPD Per Capita as a nation's measure of wealth and the Gini Index as a nation's measure of income inequality.

Data Cleaning

The world bank .csv data for the two indicators contains data regarding nations by year. There are also irrelevant columns regarding Country Codes, Indicator Names, Indicator Codes, and an additional column at the end.

Additionally, since the World Bank does not always have updated values each year for some indicators, there are many NaN values and missing year entries for each nation.

Next, I examine the shape of the data by year, to see any obvious outliers or strange shapes.

Immediately, we see that the GDP Per Capita data is incredibly right skewed.

To make this more friendly to work with, I will take its natural log as a new column in the GDP Per Capita dataframe.

Instead of immediately looking at the log GDP Per Capita and Gini Index of a nation at specific years, I want to first take world averages and see if there is an immediate association.

I create a new dataframe containing the average world GDP Per Capita and Gini Index for each year the World Bank has tracked.

Plotting the scatterplot for these values, we immediately see a potential polynomial association.

One variable polynomial regression.

Looking at the graph, it seems like the polynomial should be three or four degrees. I split the data into training and test datasets to check for overfitting.

Using the polynomial equation found from the training data on the test data, it becomes apparent that the residuals plot is not evenly distributed. At this point, I abandon the prospect of merely using world averages at certain years. Not only are there too few entries to consider, but some of the initial years (1960, 1961) only have one or two nations that had both a GDP Per Capita and Gini Index region.

Now, I hope to explore the relationship between GDP Per Capita and Gini Index for specific nations at specific years.

However, the existing data is difficult to work with. The data is separated into two dataframes and contains many missing values.

To solve this, I will iterate through the two dataframes year by year, creating a new dataframe with each row being a nation's GDP Per Capita and Gini Index at a specific year. Since I am interested in both GDP Per Capita and Gini Index, I will ignore instances where a nation had a year with one or more missing indicator inputs.

One Variable Linear Regression.

Check the conditions for inference with this model.

Linearity

From the residuals plot below, we see that the residuals are slightly unevenly distributed above and below the best zero line. We see that fitted values from roughly 32 to 37 are far closer to the zero line than larger fitted values. However, we will proceed with caution.

Constant Variability of Residuals

The residuals are relatively evenly distributed from left to right. We meet the condition of Constant Variability of Residuals.

Normality of Residuals

We see that the residuals are normally distributed. We meet condition of Normality of Residuals.

Independence of Observations

The observations are not independent. A nation's GDP Per Capita and Gini Index for one year is heavily influenced by its previous years' GDP Per Capita and Gini Index. However, we will still proceed with caution, as we are interested in the relationship between the two values by year and by nation.

Conclusion

With an R squared of 0.155, there seems to be a weak linear relationship between a nation's GDP Per Capita and Gini Index. Since Log GDP Per Capita had a negative coefficient, they have a weak negative linear relationship.

More Questions Arise?

Even if GDP Per Capita is a relatively weak predictor for Gini Index, would adding more indicators in a multiple variable linear regression lead to a stronger relationship?

I decided to add Adult Literacy Rate (as % of those 15 and above), Birth Rate (per 1000 people), Electricity Access (as % of population), Female Labor (as % of total labor force), Government Debt, Inflation (annual %), Life Expectancy (years), and Population as additional indicators.

Aggregate and transform the individual csv dataframes into rows for specific year/nation

Multiple Variable Linear Regression

Check the conditions for linear regression

Linearity

From the residuals plot below, we see that the residuals are relatively evenly distributed above and below the zero line. We meet the condition of Linearity.

Constant Variability of Residuals

The residuals are relatively evenly distributed from left to right. We meet the condition of Constant Variability of Residuals.

Normality of Residuals

We see that the residuals are relatively normally distributed (slight left skew). We proceed with caution.

Multicolinearity

Since we are now working with more than one independent variable, we want to ensure that there do not exist any strong associations between two or more of these independent variables. From the variance inflation factor table calculated below, there are many variables that may have linear relationships with each other.

GDP Per Capita, Literacy Rate, Electricity Access, Female Workforce Percentage, Life Expectancy, and Population all have very high VIFs. This makes sense, as wealthier nations typically have a higher literacy rate, birth rate, access to electricity, female labor percentage, life expectancy, and population.

I did a poor job of selecting variables. There exists multicolinearity!

To prevent mulitcolinearity, create a reduced model using only one of these five variables (log GDP Per Capita)

Multicolinearity

Checking for VIF again with the reduced model, we see that all are relatively low. We meet the no Multicolinearity condition. We proceed with caution regarding the independence of observations condition for the same reason as before.

Conclusion

With an Rsquared of 0.046, it seems like adding the two variables of Fuel Export Percentage and Inflation Rate created a linear model that did a worse job.

Another Question Arises

What if I do a better job of selecting indicators to examine?

I will look at the economic makeup of a nation, considering indicators like Goods and Services Imports (as % of GDP), Goods and Services Exports (as % of GDP), High Tech Exports (as % of manufactured exports), International Tourism Expenditures (as % of total imports), Ore/Metal Exports (as % of merchandise exports), and Fuel Exports (as % of merchandise exports).

The reasoning behind this is that a nation's exports typically serve as a good snapshot into the nation's economy. Reading about states like the UAE in the news that are known for stark income disparities, I wonder if economies that rely on natural resource exports (like Fuel and Ore/Metal) are more likely to have stark economic disparities. The logic behind that hypothesis could be that it is easier for a few individuals to control oil production and harder for new players to secure the infrastructure/rights to compete.

Earlier Multicolinearity Check

This time around, I will be checking for Multicolinearity to know which variables to remove before proceeding with the regression.

We see below that Goods and Services Imports and Goods and Services Exports have relatively high VIFs in the 20s. This definitely makes sense. In my reduced model, I will remove Imports and only consider Exports.

Multicolinearity

With these new VIF values, we meet the condition for no multicolinearity.

Perform Multiple Variable Linear Regression

Check Conditions for Linear Regression

Linearity

We see that the residuals are relatively evenly distributed above and below the zero line. Ignoring what seems to be outliers for fitted values roughly below 25 and above 48, we can proceed with caution.

Constant Variability of Residuals

We see that the residuals are relatively evenly distributed from left to right. The residuals seem to be in the shape of a circle, but we will proceed with caution.

Normality of Residuals

The histogram of residuals seems relatively normal. We meet the condition of Normality of Residuals.

Testing the Model

Now we use the same model from the training data on the test data to check for overfitting.

Conclusion

The residuals of the model and test data seems relatively evenly distributed and normal.

With this, we see that when using GDP Per Capita, Goods and Services Exports, Technology Exports, Ore Metal Exports, Fuel Export Percentage, and Tourism Expenditure to create a multiple variable linear regression model for Gini Index, we have relatively weak linear relationship with and R squared of .22. (Stronger than with just the GDP Per Capita!)

The model provides interesting insights.

From the model we see that Technology Exports and Fuel Exports have high p-values, which may suggest that these two variables are not that important in determining Gini Index. On the other hand, all of the other variables have incredibly low p-values.

GDP Per Capita, Goods and Servicex Exports, and Tourism Expenditures have negative cofficients. This means that nations with higher levels of those three variables have lower levels of income inequalirty.

Technology, Ore/Metal, and Fuel exports all have positive coefficients. This may suggest that nations with higher levels of these three variables have higher levels of income inequality.

Multiple Variable Polynomial Regression

Maybe the shape of the relationship is polynomial rather than linear?

Test for Overfitting

Using the same model we created with the training data, we predict values within the test dat to check for overfitting. The result is a relatively evenly distributed residuals plot.